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Abstract 

In this paper, current dependency- 
based treebanks are introduced and 
analyzed. The methods used for 
building the resources, the annota- 
tion schemes applied, and the tools 
used (such as POS taggers, parsers 
and annotation software) are dis- 
cussed. 

1 Introduction 

Annotated data is a crucial resource for de- 
velopments in computational linguistics and 
natural language processing. Syntactically 
annotated corpora, treebanks, are needed for 
developing and evaluating natural language 
processing applications, as well as for re- 
search in empirical linguistics. The choice of 
annotation type in a treebank usually boils 
down to two options: the linguistic resource 
is annotated either according to some con- 
stituent or functional structure scheme. As 
the name treebank suggests, these linguistic 
resources were first developed in the phrase- 
structure framework, usually represented as 
tree-shaped constructions. The first efforts 
to create such resources started around 30 
years ago. The most well-known of such 
a treebank is the Penn Treebank for En- 
glish flMarcus et al., 1993 ). 

In recent years, there has been a wide in- 
terest towards functional annotation of tree- 
banks. In particular, many dependency-based 
treebanks have been constructed. In addition, 
grammatical function annotation has been 
added to some constituent-type treebanks. 
Dependency Grammar formalisms stem from 



the work of Tesnieere (Q1559J. In dependency 
grammars, only the lexical nodes are recog- 
nized, and the phrasal ones are omitted. The 
lexical nodes are linked with directed binary 
relations. The most commonly used argument 
for selecting the dependency format for build- 
ing a treebank is that the treebank is being 
created for a language with a relatively free 
word order. Such treebanks exist e.g. for 
Basque, Czech, German and Turkish. On 
the other hand, dependency treebanks have 
been developed for languages such as English, 
which have been usually seen as languages 
that can be better represented with con- 
stituent formalism. The motivations for us- 
ing dependency annotation vary from the fact 
that the type of structure is the one needed 
by many, if not most, applications to the 
fact that it offers a proper interface between 
syntactic and semantic representation. Fur- 
thermore, dependency structures can be au- 
tomatically converted into phrase structures 



if needed ( |Lin, 19951 |Xia and Palmer, 20 00), 
although not always with 100% accuracy. 
The TIGER Treebank of German, a free 
word order language, with 50,000 sentences 
is an example of a treebank with both 
phrase structure and dependency annota- 
tions (IBrants et al., 2002). 



The aim of this paper is to answer the fol- 
lowing questions about the current state-of- 
art in dependency treebanking: 

• What kinds of texts do the treebanks 
consist of? 

• What types of annotation schemes and 
formats are applied? 

• What kinds of annotation methods and 



tools are used for creating the treebanks? 3,030 morphological tags in the morphologi- 



• What kinds of functions do the anno- 
tation tools for creating the treebanks 
have? 

We start by introducing the existing 
dependency-based treebanks (Section 2). In 
Section 3, the status and state-of-art in de- 
pendency treebanking is summarized and an- 
alyzed. Finally in Section 4, we conclude the 
findings. 

2 Existing dependency treebanks 

2.1 Introduction 

Several kinds of resources and tools are 
needed for constructing a treebank: anno- 
tation guidelines state the conventions that 
guide the annotators throughout their work, 
a software tool is needed to aid the annota- 
tion work, and in the case of semi-automated 
treebank construction, a part- of- speech (POS) 
tagger, morphological analyzer and/or a syn- 
tactic parser are also needed. Building trees 
manually is a very slow and error-prone pro- 
cess. The most commonly used method for 
developing a treebank is a combination of au- 
tomatic and manual processing, but the prac- 
tical method of implementation varies con- 
siderably. There are some treebanks that 
have been annotated completely manually, 
but with taggers and parsers available to au- 
tomate some of the work such a method is 
rarely employed in state-of-the-art treebank- 
ing. 

2.2 The Treebanks 

2.2.1 Prague Dependency Treebank 

The largest of the existing dependency tree- 
banks (around 90,000 sentences), the Prague 
Dependency Treebank for Czech, is anno- 
tated in layered structure annotation, con- 
sisting of three levels: morphological, ana- 
lytical (syntax), and tectogrammatical (se- 
mantics) ( Bohmova et al.,~2b 03). The data 
consist of newspaper articles on diverse top- 
ics {e.g. politics, sports, culture) and texts 
from popular science magazines, selected from 
the Czech National Corpus. There are 



cal tagset ( Hajic, 1998 1 . The syntactic anno- 
tation comprises of 23 dependency types. 

The annotation for the levels was done 
separately, by different groups of annotators. 
The morphological tagging was performed by 
two human annotators selecting the appro- 
priate tag from a list proposed by a tag- 
ging system. Third annotator then resolved 
any differences between the two annotations. 
The syntactic annotation was at first done 
completely manually, only by the aid of am- 
biguous morphological tags and a graphical 
user interface. Later, some functions for au- 
tomatically assigning part of the tags were 
implemented. After some 19,000 sentences 
were annotated, Collins lexicalized stochastic 



parser (Nell eke et al., 199 9) was trained with 
the data, and was capable of assigning 80% of 
the dependencies correct. At that stage, the 
work of the annotator changed from building 
the trees from scratch to checking and correct- 
ing the parses assigned by the parser, except 
for the analytical functions, which still had to 
be assigned manually. The details related to 
the tectogrammatical level are omitted here. 
Figure 1 illustrates an example of morpholog- 
ical and analytical levels of annotation. 

There are other treebank projects using 
the framework developed for the Prague De- 
pendency Treebank. Prague Arabic Depen- 
dency Treebank (H ajic et al., 2004 ), consist- 
ing of around 49,000 tokens of newswire 
texts from Arabic Gigaword and Penn Ara- 
bic Treebank, is a treebank of Modern Stan- 
dard Arabic. The Slovene Dependency Tree- 
bank consists of around 500 annotated sen- 
tences obtained from the MULTEXT-East 
Corpus (Erj avec, 2005b~[|Er] avec, 2005aD- 



2.2.2 TIGER Treebank 

The TIGER Treeba nk of Ger- 
man flBrants et al., 2002D was devel- 
oped based on the NEGRA Cor- 
pus ( |Skut et al., 1998 ) and consists of 
complete articles covering diverse topics 
collected from a German newspaper. The 
treebank has around 50,000 sentences. The 
syntactic annotation combining both phrase- 



<f cap>Do<l>do<t>RR-2— —<A>AuxP<r>l<g>7 
<f num>15<l>15<t>C=— —<A>Atr <r>2<g>4 
<d>.<l>.<t>Z:— — <A>AuxG<r>3<g>2 

</> kvetna< l> kveten< t> NNIS2 A —<A >Adv< r>4<g>l 

</> budou< l> byt< t> VB-P—3F-AA —<A> AuxV<r> 5<g>7 

</> cestujici< l> cestuj{ci< t> NNMP1 A—<A>Sb <r>6<g>7 

<f>platit< l>platit< t> Vf A —<A> Pred< r> 7< g> 

</> dosud< l> dosud< t> Db— —<A>Adv< r> 8< g> 9 
<f>platnym< l>platny< t>AAIS7— 1A — <A>Atr<r>9<g> 10 

</> zp0usobem<l> zpusob<t> NNIS7 A—<A>Adv<r>10<g>7 

<d>.<l>.<t>Z:— —<A>AuxK<r>ll<g>0 



Figure 1: A morphologically and analytically annotated sentence from the Prague Dependency 
Treebank. 



structure and dependency representations is 
organized as follows: phrase categories are 
marked in non-terminals, POS information 
in terminals and syntactic functions in the 
edges. The syntactic annotation is rather 
simple and flat in order to reduce the amount 
of attachment ambiguities. An interesting 
feature in the treebank is that a MySQL 
database is used for storing the annotations, 
from where they can be exported into NE- 
GRA Export and TIGER-XML file formats, 
which makes it usable and exchangeable with 
a range of tools. 

The annotation tool Annotate with two 
methods, interactive and Lexical-Functional 
Grammar (LFG) parsing, was employed in 
creating the treebank. LFG parsing is a typi- 
cal semi-automated annotation method, com- 
prising of processing the input texts by a 
parser and a human annotator disambiguat- 
ing and correcting the output. In the case 
of TIGER Treebank, a broad coverage LFG 
parser is used, producing the constituent and 
functional structures for the sentences. As 
almost every sentence is left with unresolved 
ambiguities, a human annotator is needed to 
select the correct ones from the set of possi- 
ble parses. As each sentence of the corpus has 
several thousands of possible LFG representa- 
tions, a mechanism for automatically reducing 
the number of parses is applied, dropping the 
number of parses represented to the human 
annotator to 17 on average. Interactive anno- 
tation is also a type of semi-automated anno- 
tation, but in contrast to human post-editing, 
the method makes the parser and the anno- 



tator to interact. First, the parser annotates 
a small part of the sentence and the annota- 
tor either accepts or rejects it based on visual 
inspection. The process is repeated until the 
sentence is annotated completely. 

2.2.3 Arboretum, L'Arboratoire, 
Arborest and Floresta 
Sinta(c)tica 

Arboretum of Danish (Bick, 2003), 
L'Arboratoire of French and 

Floresta Sintd(c)tica of Por- 

tuguese ( |Afonso et al., 2002| ), and Arborest 
of Estonian pick et al.,~ 2005) are "sibling" 
treebanks, Arboretum being the "oldest 
sister". The treebanks are hybrids with 
both constituent and dependency annotation 
organized into two separate levels. The levels 
share the same morphological tagset. The 
dependency annotation is based on the Con- 
straint Grammar (CG) ( |Karlsson, 1990 ) and 
consists of 28 dependency types. For creating 
each of the four treebanks, a CG-based mor- 
phological analyzer and parser was applied. 
The annotation process consisted of CG 
parsing of the texts followed by conversion to 
constituent format, and manual checking of 
the structures. 

Arboretum has around 21,600 sentences an- 
notated with dependency tags, and of those, 
12,000 sentences have also been marked 
with constituent structures ( Bick, 2 003; 
Bick, 2005). The annotation is in both 



TIGER-XML and PENN export formats. 
Floresta Sinta(c)tica consists of around 9,500 
manually checked (version 6.8, October 15th, 



2005) and around 41,000 fully automatically 
annotated sentences obtained from a corpus of 
newspaper Portuguese (Afonso et aL~2002). 
Arborest of Estonian consists of 149 sentences 
from newspaper articles (Bic k et al.,~2 005) 
The morphosyntactic and CG-based surface 
syntactic annotation are obtained from an 
existing corpus, which is converted semi- 
automatically to Arboretum-style format. 

2.2.4 The Dependency Treebank for 
Russian 

The Dependency Treebank for Rus- 
sian is based on the Uppsala Univer- 
sity Corpus (Lonngren, 1993). The 
texts are collected from contempo- 
rary Russian prose, newspapers, and 
magazines ( |Boguslavsky et al., 2000| 

Bogu slavsky et al., 2002| ). The treebank 
has about 12,000 annotated sentences. The 
annotation scheme is XML-based and com- 
patible with Text Encoding for Interchange 
(TEI), except for some added elements. It 
consists of 78 syntactic relations, divided 
into six subgroups, such as attributive, quan- 
titative, and coordinative. The annotation 
is layered, in the sense that the levels of 
annotation are independent and can be 
extracted or processed independently. 

The creation of the treebank started 
by processing the texts with a morpho- 
logical analyzer and a syntactic parser, 
ETAP flApresjan et al., 199 2), and was fol- 
lowed by post-editing by human annotators. 
Two tools are available for the annotator: 
a sentence boundary markup tool and post- 
editor. The post-editor offers the annotator 
functions for building, editing, and manag- 
ing the annotations. The editor has a special 
split-and-run mode, used when the parsers 
fails to produce a parse or creates a parse with 
a high number of errors. In the mode the user 
can pre-chunk the sentence into smaller pieces 
to be input to the parser. The parsed chucks 
can be linked by the annotator, thus produc- 
ing a full parse for the sentence. The tool also 
provides the annotator with the possibility to 
mark the annotation of any word or sentence 
as doubtful, in order to remind at the need 



for a later revision. 



2.2.5 Alpino 

The Alpino Treebank of Dutch, consist- 
ing of 6,000 sentences, is targeted mainly at 
parser evaluation and comprises of newspa- 
per articles (van der Beek et al., 2002| ). The 
annotation scheme is taken from the CGN 
Corpus of spoken Dutch (Oostdijk, 2000) and 
the annotation guidelines are based on the 
TIGER Treebank's guidelines. 



The annotation process in the Alpino Tree- 
bank starts with applying a parser based 
on Head-Driven Phrase Structure Grammar 
(HPSG) ( [Pollard and Sag, 1994| ) and is fol- 
lowed by a manual selection of the correct 
parse trees. An interactive lexical analyzer 
and a constituent marker tools are employed 
to restrict the number of possible parses. The 
interactive lexical analyzer tool lets the user 
to mark each word in a sentence belonging to 
correct, good, or bad categories. 'Correct' de- 
notes that the parse includes the lexical entry 
in question, 'good' that the parse may include 
the entry, and 'bad' that the entry is incorrect. 
The parser uses this manually reduced set of 
entries, thus generating a smaller set of possi- 
ble parses. With the constituent marker tool, 
the annotator can mark constituents and their 
types to sentences, thus aiding the parser. 

The selection of the correct parse is done by 
the help of a parse selection tool, which calcu- 
lates maximal discriminants to help the anno- 
tator. There are three types of discriminants. 
Maximal discriminants are sets of shortest de- 
pendency paths encoding differences between 
parses, lexical discriminants represent ambi- 
guities resulting from lexical analysis, and 
constituent discriminants group words to con- 
stituents without specifying the type of the 
constituent. The annotator marks each of the 
maximal discriminants as good or bad, and 
the tool narrows down the number of possible 
parses based on the information. If the parse 
resulting from the selection is not correct, it 
can be edited by a parse editor tool. 



2.2.6 The Danish Dependency 
Treebank 



The annotation of 
pendency Treebank is 
countinuous Grammar, 
malism closely related 
mar flKromann, 2003| ). 
consists of 5,540 sentences covering a wide 
range of topics. The morphosyntactic anno- 
tation is obtained from the PAROLE Cor- 



the Danish De- 
based on Dis- 
which is a for- 
to Word Gram- 
The treebank 



pus ( Keson and Norling-C hristensen, 20 05), 
thus no morphological analyzer or POS tagger 
is applied. The dependency links are marked 
manually by using a command-line interface 
with a graphical parse view. A parser for 
automatically assigning the dependency links 
is under development. 

2.2.7 METU-Sabanci Turkish 
Treebank 

Morphologically and syntactically anno- 
tated Turkish Treebank consists of 5,000 
sentences obtained from the METU Turk- 
ish Corpus ( Atalay et al., 2003 1 covering 
16 main genres of present-day written 
Turkish (Olla/er et al., 2003| ). The an- 
notation is presented in a format that 
is in conformance with the XML-based 
Corpus Encoding Standard (XCES) for- 
mat ( Anne and Romary, 2003] ). Due to mor- 
phological complexity of Turkish, morpholog- 
ical information is not encoded with a fixed 
set of tags, but as sequences of inflectional 
groups (IGs). An IG is a sequence of in- 
flectional morphemes, divided by derivation 
boundaries. The dependencies between IGs 
are annotated with the following 10 link types: 
subject, object, modifier, possessor, classifier, 
determiner, dative adjunct, locative adjunct, 
ablative adjunct, and instrumental adjunct. 
Figure El illustrates a sample annotated sen- 
tence from the treebank. 

The annotation, directed by the guidelines, 
is done in a semi-automated fashion, although 
relatively lot of manual work remains. First, a 
morphological analyzer based on the two-level 



tion tool. The tagging process requires two 
steps: morphological disambiguation and de- 
pendency tagging. The annotator selects the 
correct tag from the list of tags proposed by 
the morphological analyzer. After the whole 
sentence has been disambiguated, dependency 
links are specified manually. The annotators 
can also add notes and modify the list of de- 
pendency link types. 

2.2.8 The Basque Dependency 
Treebank 

The Basque Dependency Tree- 
bank (Aduriz and al., 2003|) consists of 



3,000 manually annotated sentences from 
newspaper articles. The syntactic tags are 
organized as a hierarchy. The annotation 
is done by aid of an annotation tool, with 
tree visualization and automatic tag syntax 
checking capabilities. 

2.2.9 The Turin University Treebank 

The Turin University Treebank for Italian 
consisting of 1,500 sentences is divided 
into four sub-corpora JLesmo et al., 20021 
IBosco, 2000] IBosco and Lombardo, 20031 • 
The majority of texts is from civil law code 
and newspaper articles. The annotation 
format is based on the Augmented Relational 
Structure (ARS). The POS tagset consists 
of 16 categories and 51 subcategories. There 
are around 200 dependency types, organized 
as a taxonomy of five levels. The scheme 
provides the annotator with the possibility 
of marking a relation as under- specified if a 
correct relation type cannot be determined. 

The annotation process consists of auto- 
matic tokenization, morphological analysis 
and POS disambiguation, followed by syntac- 
tic parsing (Lesmo et al., 2002). The anno- 



morphology model (Oflazer, 1994) is applied 
to the texts. The morphologically analyzed 
and preprocessed text is input to an annota- 



tator can interact with the parser through a 
graphical interface, in a similar way to the 
interactive method in the TIGER Treebank. 
The annotator can either accept or reject the 
suggested tags for each word in the sentence 
after which the parser proceeds to the next 
word (|Bosco, 2000]) . 
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Figure 2: A sample sentence from the METU-Sabanci Treebank. 



2.2.10 The Dependency Treebank of 
English 

The Dependency Treebank of English con- 
sists of dialogues between a travel agent and 
customers (Rambow et al., 2002|), and is the 



only dependency treebank with spoken lan- 
guage annotation. The treebank has about 
13,000 words. The annotation is a direct 
representation of lexical predicate-argument 
structure, thus arguments and adjuncts are 
dependents of their predicates and all func- 
tion words are attached to their lexical heads. 
The annotation is done at a single, syntactic 
level, without surface representation for sur- 
face syntax, the aim being to keep the an- 
notation process as simple as possible. Fig- 
ure 3 shows an example of an annotated sen- 
tence (|Rambow et al. , 2002 ) . 



The trained annotators have access to an 
on-line manual and work off the transcribed 
speech without access to the speech files. The 
dialogs are parsed with a dependency parser, 
the Supertagger and Lightweight Dependency 
Analyzer (Ban galore and Joshi, 1999 ). The 
annotators correct the output of the parser 
using a graphical tool, the one developed by 
Prague Dependency Treebank project. In ad- 
dition to the standard tag editing options, an- 
notators can add comments. After the editing 
is done, the sentence is automatically checked 
for inconsistencies, such as the difference in 
surface and deep roles or prepositions missing 
objects etc. 

2.2.11 DEPBANK 

As the name suggests, the PARC 
700 Dependency Bank (DEP- 
BANK) dKing et al., "20031 consists of 700 an- 
notated sentences from the Penn Wall Street 
Journal Treebank (Marcus et al., 1994). 



There are 19 grammatical relation types 
{e.g. subject, object, modifier) and 37 
feature types [e.g. number (pl/sg), passive 
(+/-), tense (future/past/present)) in the 
annotation scheme. 

The annotation process is semi-automatic, 
consisting of parsing by broad-coverage LFG, 
converting the parses to the DEPBANK for- 
mat and manually checking and correcting 
the resulting structures. The annotations are 
checked by a tool that looks e.g. for the cor- 
rectness of header information and the syn- 
tax of the annotation, and inconsistencies in 
feature names. The checking tool helps in 
two different ways: first, when the annota- 
tor makes corrections to the parsed structure, 
it makes sure that no errors were added, and 
second, the tool can detect erroneous parses 
and note that to the annotator. 

3 Analysis 

Table ^ summarizes some key properties of 
the existing dependency treebanks. The size 
of the treebanks is usually quite limited, rang- 
ing from few hundreds to 90,000 sentences. 
This is partly due to the fact that even the 
most long-lived of the dependency treebank 
projects, the Prague Dependency Treebank, 
was started less than 10 years ago. The tree- 
bank producers have in most cases aimed at 
creating a multipurpose resource for evaluat- 
ing and developing NLP systems and for stud- 
ies in theoretical linguistics. Some are built 
for specific purposes, e.g. the Alpino Tree- 
bank of Dutch is mainly for parser evaluation. 
Most of the dependency treebanks consist of 
written text; to our knowledge there is only 
one that is based on a collection of spoken 
utterances. The written texts are most com- 
monly obtained from newspaper articles, and 



FRR: passive 

fiighj vi ill have been 
PN Aas Am Aux 
SRoleSubj Aft Adj Aft 
DRole:Obj 

Dei 
Adj 



Figure 3: The sentence "The flight will have been booked" from the English treebank. The 
words are marked with the word form (first line), the POS (second line), and the surface role 
(third line). In addition, node 'flight' is marked with a deep role (DRole) and the root node as 
passive in the FRR feature, not set in any other nodes. 



in the cases of e.g. Czech, German, Russian, 
Turkish, Danish, and Dutch treebanks from 
an existing corpus. Annotation usually con- 
sists of POS and morphological levels accom- 
panied by dependency-based syntactic anno- 
tation. In the case of the Prague Dependency 
Treebank a higher, semantic layer of annota- 
tion is also included. 

The definition of the annotation schema 
is always a trade-off between the accuracy 
of the representation, data coverage and 
cost of treebank development (Bosco, 2000 
Bosco and Lombardo, 200 3 ) . The selection of 
the tagsets for annotation is critical. Using 
a large variety of tags provides a high accu- 
racy and specialization in the description, but 
makes the annotators' work even more time- 
consuming. In addition, for some applica- 
tions, such as training of statistical parsers, 
highly specific annotation easily leads into 
sparsity problem. On the other hand, if an- 
notation is done at a highly general level the 
annotation process is faster, but naturally lot 
of information is lost. The TUT and Basque 
treebanks try to tackle the problem by orga- 
nizing the set of grammatical relations into hi- 
erarchical taxonomy. Also the choice of type 
of application for the treebank may affect the 
annotation choices. A treebank for evalua- 
tion allows for some remaining ambiguities 
but no errors, while the opposite may be true 
for a treebank for training 



jeille, 2003D . In 
annotation consisting of multiple levels clear 
separation between the levels is a concern. 



The format of the annotation is also directed 
by the specific language that the treebank 
is being developed for. The format must 
be suited for representing the structures of 
the language. For example, in the METU- 
Sabanci Treebank a special type of morpho- 
logical annotation scheme was introduced due 
to the complexity of Turkish morphology. 

Semi-automated creation combining pars- 
ing and human checker is the state-of-art an- 
notation method. None of the dependency 
treebanks are created completely manually; at 
least an annotation tool capable of visualizing 
the structures is used by each of the projects. 
Obviously, the reason that there aren't any 
fully automatically created dependency tree- 
banks is the fact there are no parsers of free 
text capable of producing error-free parses. 

The most common way of combining the 
human and machine labor is to let the human 
work post-checker of the parser's output. 
Albeit most straight-forward to implement, 
the method has some pitfalls. First, starting 
annotation with parsing can lead to high num- 
ber of unresolved ambiguities, making the se- 
lection of the correct parse a time-consuming 
task. Thus, a parser applied for treebank 
building should perform at least some disam- 
biguation to ease the burden of annotators. 
Second, the work of post-checker is mechanic 
and there is a risk that the checker just ac- 
cept the parser's suggestions, without a rig- 
orous inspection. A solution followed e.g. 
by the both treebanks for English and the 



Table 1: Comparison of dependency treebanks. (*Duc to limited number of pages not all the treebanks 
in the Arboretum "family" are included in the table. **Information of number of utterances was not available. 
M=manual, SA=semi-automatic, TB=treebank) 
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Basque treebank is to apply a post-checking 
tool to the created structures before accept- 
ing them. Some variants of semi-automated 
annotation exist: the TIGER, TUT, Alpino, 
and the Russian Treebanks apply a method 
where the parser and the annotator can in- 
teract. The advantage of the method is that 
when the errors by the parser are corrected 
by the human at the lower levels, they do not 
multiply into the higher levels, thus making it 
more probable that the parser produces a cor- 
rect parse. In some annotation tools, such as 
the tools of the Russian, the English Depen- 
dency treebanks, the annotator is provided 
with the possibility of adding comments to 
annotation, easing the further inspection of 
doubtful structures. In the annotation tool of 
the TUT Treebank, a special type relation can 
be assigned to mark doubtful annotations. 

Although more collaboration has emerged 
between treebank projects in recent years, the 
main problem with current treebanks in re- 
gards to their use and distribution is the fact 
that instead of reusing existing formats, new 
ones have been developed. Furthermore, the 
schemes have often been designed from the- 
ory and even application-specific viewpoints, 
and consequently, undermine the possibility 
for reuse. Considering the high costs of tree- 
bank development (for example in the case 
of the Prague Dependency Treebank esti- 
mated USD600,000 flBohmova et al., 2003D ), 
reusability of tools and formats should have 
a high priority. In addition to the dif- 
ficulties for reuse, creating a treebank- 
specific representation format requires de- 
veloping a new set of tools for creat- 
ing, maintaining and searching the tree- 
bank. Yet, the existence of exchange formats 
such as XCES ( |Anne and Romary, 2 003) 
and TIGER-XML flMengel and Lezius, 2000) ) 
would allow multipurpose tools to be created 
and used. 

4 Conclusion 

We have introduced the state-of-art in de- 
pendency treebanking and discussed the main 
characteristics of current treebanks. The find- 
ings reported in the paper will be used in de- 



signing and constructing an annotation tool 
for dependency treebanks and constructing a 
treebank for Finnish for syntactic parser eval- 
uation purposes. The choice of dependency 
format for a treebank for evaluating syntactic 
parser of Finnish is self-evident, Finnish be- 
ing a language with relatively free word order 
and all parsers for the language working in the 
dependency-framework. The annotation for- 
mat will be some of the existing XML-based 
formats, allowing existing tools to be applied 
for searching and editing the treebank. 

The findings reported in this paper indicate 
that the following key properties must be im- 
plemented into the annotation tool for creat- 
ing the treebank for Finnish: 

• An interface to a morphological analyzer 
and parser for constructing the initial 
trees. Several parsers can be applied in 
parallel to offer the annotator a possibil- 
ity to compare the outputs. 

• Support for an existing XML annotation 
format. Using an existing format will 
make the system more reusable. XML- 
based formats offer good syntax-checking 
capabilities. 

• An inconsistency checker. The annotated 
sentences to be saved will be checked 
against errors in tags and annotation for- 
mat. In addition to XML-based valida- 
tion of the syntax of the annotation, the 
inconsistency checker will inform the an- 
notator about several other types of mis- 
takes. The POS and morphological tags 
will be checked to find any mismatching 
combinations. A missing main verb, a 
fragmented, incomplete parse etc. will 
be indicated to the user. 

• A comment tool. The annotator will be 
able to add comments to the annotations 
to aid later revision. 

• Menu-based tagging. In order to mini- 
mize errors, instead of typing the tags, 
the annotator will only be able to set tags 
by selecting them from predefined lists. 
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